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ABSTRACT 



To locate matches across pairs of lists without 



unique identifiers it is sometimes necessary to compare strings of 
letters. String comparators are used in production computer matching 
software during the Post Enumeration Survey for the 1990 U.S. census. 
A string comparator metric is described that partial] y accounts for: 
(1> typographical variation in strings such as first name or surname; 
(2) decision rules that use the string comparator; and (3) 
improvements in empirical matching results. The string comparator 
metric for comparing partially agreeing strings extends the Jaro 
string comparator. How general methods of accounting for partial 
agreement fit with the Fellegi-Sunter (I. P. Fellegi and A. B. 
Sunter, 1969) model of record linkage is described. A formal method 
of modeling how to adjust matching weights between pure agreement and 
pure disagreement is presented. The procedure is illustrated for 
files for which the truth of matches is known. It is demonstrated 
that the theoretical rules of Fellegi and Sunter are still valid when 
general weighting adjustments accounting for partial agreement are 
performed. Eight tables contain illustrative data. (SLD) 
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ABSTRACT 

This paper describes a string comparator metric that 
partially accounts for typographical variation in 
strings such as first name or surname, decision rules 
that utilize the string comparator, and improvements 
in empirical matching results. The string comparators 
are used in production computer matching software 
during the Post Enumeration Survey for the 1990 
Census. The Post Enumeration Survey will use 
capture/recapture and other statistical techniques to 
produce a set of adjusted Census counts. 

1. INTRODUCTION 

Locating matches across a pair of lists not having 
unique identifiers such as a social security number is 
often difficult Typically available identifiers such as 
first name, last name, and various demographic, 
economic* ot address components may not uniquely 
identify matches because of legitimate variations. 

Some types of legitimate variations in identifiers 
generally require a priori knowledge that allows rapid, 
accurate utilization. Such variations might take the 
forms Mrs W M Smith and Elizabeth Smith. 
Typographical variations such as Elizabeth Smith 
versus Elzbath Smoth are a special case of legitimate 
variations. They are more easily dealt with if suitable 
methods of comparing strings are available and are 
the only variations that we will consider in this paper. 

If Si and $2 are two strings* a string comparator 
^ merely maps the pair {Sx,S^ to the closed interval 
[0,1]. A string comparator is not necessarily a metric 
in the mathematical sense and the restriction of its 
range to [0,1] is done priniarily for convenience. 
Generally* we want pairs of strings that agree exactly 
to be assigned value 1, pairs of strings that agree 
almost exactly (in some sense) to have values close to 
1 , and strings that entirely disagree (in some sense) to 
have value 0. 

A simple example of a string comparator is a 
function that assigns value 1 to a pair of strings that 
agree exactly or agree exactly on a code such as 
Soundex and* otherwise, assigns value 0. Another 
example would be a properly ncmnalized Damerau- 
Levenstein metric that accounts f(x the number of 
insertions and deletions it takes to get from one string 
to another (see e.g., Winkler 1985). 

This paper provides a class of string comparator 



metrics for comparing partially agreeing strings that 
extend the Jaro string comparator (see e.g., Winkler 
1985). It formally shows how general methods of 
accounting for partial agreement fit in with the 
FeIIegi*Sunter (1969) model of record linkage. It 
provides a formal method of modelling how to adjust 
matching weights between pure agreement and pure 
disagreement The methods are dependent on having 
a representative set of matching pairs. 

The second section of the paper consists of four 
parts. The fu^t part provides brief background on the 
Fellegi-Sunter model of record linkage. In the second 
part, we show how partial agreement relates to 
general likelihood ratios and associated information-, 
theoretic decision rules. The empirical data base is 
described in the third part. The fourth part presents ' 
the specific string .comparator and methods for 
modelling how it is used in adjusting matching 
weights between pure agreement and disagreement. 

The third section contains empirical results based 
on files for which the truth of matches is known. The 
first subsection shows how specific weight adjustment 
curves are modelled for strings such as last name. 
The second subsection contains matching results that 
show the improvements due string comparators. The 
improvements are placed in the context of all tech* 
niques implemented in current production computer 
matching software that increase matching efficacy. 

The fourth section provides discussion of the 
quality of the empiricid data bases used in the 
analyses and the limitations of the existing string 
comparator/weight adjustment method. 

The final section is a summary. 

2. BACKGROUND 

2.1. Fellegj'Sunter Model of Record Linkage 

The Fellegi-Sunter Model uses a decision-theoretic 
approach establishing the validity of principles fust 
used in practice by Newcombe (Newcombe et al. 
1959, also 1988). To give an overview, we describe 
the model in terms of ordered pairs in a product 
space. The description closely follows Fellegi and 
Sunter (1969, pp. 1184-1187). 

There are two populations A and B whose 
elements will be denoted by a and b. We assume 
that some elements are common to A and B. 
Consequently the set of ordered pairs 



AXB ((a,b): aeA. beB) 
is the union of two disjoint sets of matches 

M « {(a,b): a=b. aeA. beB) 
and nonmatches 

U« |(a.b):a;ftb. aeA. beB). 

The records corresponding to members of A and 
B are denoted by a(a) and B(b). respectively. The 
comparison vector y associated with the records is 
defined by: 

Y[a(a)JKb)l * 

(Y ^[a(a)JJ(b)].Y^[a(a) JJ(b))..-.Y^[a(a) J}(b))) . 

Each of the y'. i = 1. K. represents a specific 
comparison. For instance. Y * could represent 
agreement/disagreement on sex. Also, y^ could 
represent the comparison that two surnames agree and 
take a specific value or that they disagree. 

Where confusion does not arise, the function y on 
AXB will be denoted by 7(0^6). Y(a,b). or y- The 
set of all possible realizations of y is denoted by r. 

The conditional probability of y(a.b) given 
(a.b)eM is 

m(Y)«P{Y[a(a).B(b)]l(a.b)eM) 

« Z P|Y[a(a).B(b)])-P[(a.b)IM]. 

Similarly we denote the conditional probability of y 
given (a,b)eU by u(y). 

observe a vector of information Y(a,b) 
associated with pair (a,b) and wish to designate a 
pair as a link (denote the decision by Aj). a possible 
link (decision A,), or a nonlink (decision A,). A 
linkage rule L is defuied as a mapping from T. the 
comparison spdct, onto a set of random decision 
functions D « (d(Y)} where 

d(Y) « {P(A,^Y)J>(A,^Y)J>(A3lY)): leP 

and 

3 

ZP(A,Y)«1. 
i«l 

There are two types of error associated with a 
linkage rule. A Type I error occurs if an unmatched 
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comparison is erroneously linked. It has probability 

P(A,IU) = I u(Y)P(A,lY) 
YEP 

A Type II error occurs if a matched comparison is 
erroneously not linked It has probability 

P(A3lM)= I m(Y)P(A3lY) 
YEP 

Fellegi and Sunter (1969) define a linkage rule Lq. 
with associated decisions Aj. Aj, and A,, that is 
optimal in the following sense: 

Theorem (FeUegi-Sunter 1969). Let L' be a 
linkage rule with associated decisions A/. A2'. and 
A3' such that it has the same error probabilities 
P(A3'IM) = P(A3lM) andP(A,'IU) = P(A,IU) as U- 
Then is optimal in that P(A2lir) ^ P(A2'IU) and 
P(A2lM) <. ?(A{\M). 

In other words, if L' is any competitor of Lq- 
having the same Type I and Type II error rates 
(which are both conditional probabilities), then the 
conditional probabilities (either on set U or M) of 
not making a decision under rule L' are always 
greater than under Lq. 
2.2 General Partial Agreement 

If the set of matches M were known, then we 
could model how partial agreement affects matching 
weights as follows: 

1. Partition the closed interval [0.1] into a disjoint 
collection of subintervals (k^Jc^^J for i = 

1. N. For convenit .ice. chose k^ » (i-l)/N 
and k(i^j) = i/N for i = 1. N. If i = 0. then 
we include 0 in the interval (k^J^.^J. 

2. For each field j and for each i » I. ** N. use the 
ratio 

P(H'(YKa.b))e (k^J^^^JIM)/ 
P(H'(YKa.b))c(k,Jc,i,,>]IU). (2.1) 

as the value of the adjusted weight curve for in- 
terval (ki»k(i^J. Heie V is the string compar- 
ator, Y^ is a comparison of the jth field. (a.b) 
is an arbitrary pair. M is the set of matches, and 
U the set of nonmatches. 

3. If. for a fixed field j. the curves (step furictions) 
given by (2.1) are afqproximately the same for 
several data sets, then find a single piecewise 
linear curve as the approximation for each of 
them. 
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The piecewise linear curve will have the agreement 
weight as its highest value and the disagreement 
weight as its lowest value. We use the newly 
estimated curve on files for which the set of matches 
M is not known. 

If the string comparators and associated methods 
for modelling ratios (2.1) are reasonably accurate, 
then ihe resultant decision rules are optimal in the 
sense of Fellegi and Sunter (1%9, Theorem). 

2.3. Empirical Data Bases 

The emprica! files are the 1988 Dress Rehearsal 
Census and Post Enumeration Survey (PES). The 
geographic regions consisted of portions of St Louis, 
MO, Columbia, MO, and rural Washingto&i state. 

Fields available for matching are fu^st name, middle 
initial, last name, house number, street name, rural 
route number, postal box number, conglomerated 
addiess, telephone number, age, sex, marital status, 
relationship to head of household, and race. 

Individuals in the PES are generally only computer 
matched with those individuals in the Census that are 
in the same block cluster. A block cluster may 
consist of a Census block or several blocks. 

2.4. String Comt>arator Metrics 

Jaro (see e.g., Winkler 1985, 1989, Winkler and 
Thibaudeau 1990) introduced a string comparator 
measure that gives values of partial agreement 
between two strings. The string comparator accounts 
for length of strings and partially accounts for the 
types of errors typically made in alphanumeric strings 
by human beings. It is used in adjusting exact 
agreement weights when two strings do not agree on 
a character-by*character basis. 

Specifically, if c> 0, th;. Jaro string comparator is 

* = Wj-c/d + Wjc/r + W,(c-x)/c, 



characters unassigned . Each string has the same 
number of assigned characters. 

The number of transposiuons is computed as 
follows: The first assigned character on one string 
is compared to the first assigned character on the 
other string. If the characters are not the same, half 
of a transposition has occurred. Then the second 
assigned character on one string is compared to the 
second assigned character on the other string, etc. 
The number of mismatched characters is divided by 
two to yield the number of transpositions. 

If two strings agree on a character-by<haracter 
basis, then the Jaro string comparator <X> is set to 
Wi+W2+Wp which is the maximum value that <X> 
can assume. The minimum value that the <X> can 
assume is 0, which occurs when the two strings 
have no characters in common (subject to the above 
definiuon of common). 

For present matching applications, W^, W2, and 
Wj are arbitrarily set to 1/3. The new string 
comparator metric basically modifies the basic string 
comparator according to whether the first few 
characters in the strings being compared agree. . 
Specifically, for i = 1, 2, 3, 4, 

= * + i 0.1 (1 -<D) 

if the fu^t i characters agree. 

If w. and Wj are the estimated agreement and 
disagreement weights for a specific field, respectively, 
then the Jaro adjusted matching weight w„ used in 
the total weight calculation is given by 

w«n= j w. if O = 1, and 

j max{w..(w..wJ-(1.0)-(9/2), wj 
^ if 0 ^<D< 1. 



where 

Wi 3s weight associated with characters in 

the first of two files, 
W^ = weight associated with characters in 

the second of two files, 
W, 3s weight associated with transpositions, 
d s length of string in first file, 
r s length of string in second file, 
X 3 number of transpositions of 

characters, and 
c 3s number of characters in common in 
pair of strings. 
If csCthen 4> = 0. 

Two characters are considered in common only if 
they are no further apart than 
(m,^ - 1) where m = max(d^). Characters in 
common from two strings are assigned : remaining 



The constant 9/2 controls how quickly decreases 
in partial agreement values force the adjusted weight 
to the disagreement weight 

Instead of assuming that the same adjusunent 
procedure works for different fields such as fu^t 
name, last name, and house number, procedures for 
modelling the weight adjustment as a piecewise linear 
function were developed. The procedures necessitate 
having representative sets of pairs for which the truth 
of matches is known. The new adjusted weights w„ 
take the form 



w^. = 



w, if *„^bi 
max{w,-(w,-wJ.(l-<t>J-(a,), wj 

if b2^*„<b„ 
max{w,-(w,.wJ-(l-<t>J-(a2). wj 

if *n < bj. 



(2.2) 
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on the specific type of string (such as first n applied. 
Generally, a^ < s^. The specific constants used are 
given in Table 4 of section 3.1. 

Table 1 provides examples of string connparator 
values for pairs of last names and for pairs of first 
nam :s. The abroms-abrams example with string 
comparator value .9333 in contrast to the lampley- 
camplcy with value .9048 shows that the string 
comparator gives a higher value to the pair that 
differs by a single character further from the first 
position. The martha-marhta example with value 
.9667 in contrast to the jonathon-jonathan example 
with value .9583 shows that Disposition of two 
characters causes less of a downweighting than 
differing by one character. 



Table 1 . Values of String 
Comparator 



shacJcleford 


Shackelford 


. 9848 


Cunningham 


cunnigham 


. 9833 


campell 


Campbell 


.9792 


nichleson 


nichulson 


.9630 


massey 


massie 


.9444 


abrom3 


abrams 


.9333 


galloway 


Calloway 


.9167 


lampley 


campley 


.9048 


dixon 


diOcson 


.8533 


f redericJc 


f redric 


.9815 


michele 


michelle 


.9792 


jesse 


jessie 


.9722 


marhta 


martha 


.9667 


jonathon 


jonathan 


. 9583 


julies 


juluis 


.9333 


jeraldine 


geraldine 


.9246 


y>^ette 


yevett 


.9111 


tanya 


tonya 


. 8933 


dwayne 


duane 


.8578 




RESULTS 





The results of calculating the ratio (2.1) for various 
values of for the first name are given in Table 2 
and for the last name in Table 3. The weights in the 
last three columns correspond to disjomt intervals of 
the form (kjjcj where is given in colunm one. 
Within a table, we observe that each of the cur/es has 
roughly the same shape and the same starting and 
ending values. 

Using the constants associated with first name from 
Table 4, the piecewise linear curve (2.2) for first 
name I4;^ximates each of the weighting curves (step 
functions) in Table 2. The constants and curves 
associated with other fields such as last name and 
house number are obtained in a similar manner. 



Table 2 . String Comparator 

Values and Weights 
First Name 

Weights 







StL 




Col 


Wash 


0. 


62 


-4 . 


52 


NA 


-3 . 16 


0 . 


64 


-3 . 


13 


-3.40 


-3 .06 


0 . 


66 


-2. 


87 


-1.91 


-1.38 


0 . 


68 


-2 . 


44 


-2.50 


-2 . 39 


0. 


70 


-0 . 


92 


-1.53 


-2.08 


0. 


72 


-1 . 


02 


-1.61 


-0 .43 


0. 


74 


0. 


14 


-0.19 


0 .28 


0, 


76 


-0 . 


22 


-0.96 


-0 . 17 


0 , 


78 


0 . 


88 


0.27 


2.05 


0, 


80 


0. 


83 


0.63 


0.84 


0, 


82 


2. 


10 


2.14 


2.09 


0 


84 


2 . 


25 


1.72 


2.42 


0 


86 


2 . 


31 


2.93 


3.78 


0, 


88 


3. 


05 


2.53 


3.41 


0 


90 


3. 


46 


3.19 


2.73 


0 


92 


3. 


77 


3.09 


3.58 


0 


94 


4 . 


27 


3.56 


3.31 


0 


96 


4 . 


42 


5.03 


4.58 


0 


98 


5. 


48 


5.04 


4.34 


1 


00 


4. 


62 


4.58 


4.86 



Table 3 . String Comparator 

Values and weights 
Last Name 











-Weights— 








StL 


Col 


Wash 


0. 


62 


-5 


.35 


-5 


.57 


NA 


0. 


64 


-5 


.18 


-4 


.28 


NA 


0. 


66 


-5 


.21 


NA 


NA 


0. 


68 


-3 


.88 


-4 


.38 


-3.19 


0. 


70 


NA 


NA 


NA 


0. 


72 


-4 


.18 


NA 


-3.70 


0. 


74 


-3 


.:>6 


-2 


.95 


-2.23 


0. 


76 


-1 


.64 


-3 


.88 


-1.5b 


0. 


78 


-1 


.53 


-2 


.85 


0.11 


0. 


80 


-1 


.49 


-1 


.20 


-0.92 


0. 


82 


-0 


.47 


-0 


.65 


0.42 


0. 


34 


0 


.02 


0 


.45 


-0.46 


0. 


86 


0 


.10 


0 


.36 


0.04 


0. 


88 


1 


.08 


1 


.07 


0.31 


0. 


90 


1 


.08 


1 


.00 


1.43 


0. 


92 


1 


.14 


0 


.69 


1.33 


0. 


94 


1 


.13 


1 


.40 


1.40 


0. 


96 


1 


.29 


1 


.22 


1. 11 


0. 


98 


1 


.63 


i 


.52 


1.70 


1. 


00 


1 


.35 


1 


.08 


0.70 
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Table 4 . Constants Used in Piece- 
Wise Linear Weight 
Adjustments 



Field 






a? 




^2 


first 


1. 


5 


3.0 


.92 


.75 


last 


3. 


0 


4.5 


.96 


.88 


house # 


4. 


5 


7.5 


.98 


.83 



improvement in matches can occur when string 
comparators are first used (from param to param2). 
The basic reason is that disagreements (on a 
character-by-character basis) are replaced by partial 
agreements. Improvements due to the new string 
comparators and weighting adjustments (from em to 
em2) are quite minor. 



Weight adjustments are only performed for values 
of On greater than 0.60. Values below 0.60 are 
generally associated with pairs of strings associated 
with nonmatches in U. 

The Jaro weight adjusunent is used for the street 
field and any other fields that were not modelled. 
The street field weighting adjusunent was modelled in 
a manner similar to the last name, fu^t name, and 
house numbers. The Jaro weighting adjusunent is 
conservative because it generally downweights more 
severely than the new curves and, thus, has less of a 
tendency to assign greater than the full disagreement 
weight to disagreeing strings. 
3.2. Matching Comparison 

A comparison of matching results is given in 
Tables S, 6, and 7 for St Louts, Columbia, and 
Washington, respectively. To understand the tables, 
we need describe the types of matching procedures. 
The simplest procedure, crude, merely uses an ad hoc 
guess for matching parameters and does not use string 
comparators. 

The next, param. does not use string comparators 
but does estimate the probabilities m(Y) and u(Y). 
Such probabilities are often estimated through an 
iterative procedure that involves manual review of 
matching results and successive reuse of the 
reestimated parameters. The third type, param2. uses 
the same probabilities as param and the basic string 
comparators. 

The fourth type, em, uses an EM-Algorithm for 
estimating matching parameters (see e.g., Winkler 
1988, Thibaudeau 1990) and uses the basic Jaro suing 
comparator. The fifth type, em2. uses the EM- 
derived weights and the new string comparator and 
new weight adjustments. The fmal type, freg, 
replaces simple agree/disagree weights for fu^t name 
and last name with frequency-based weights (see e.g., 
Winkler 1989) and also makes adjusunents for joint 
dependencies of agreements on fu^t name, sex, and 
age. 

In each table, the number of matches is determined 
by a false match rate of 0.002. The crude and param 
types are aUowed to rise slightly above the 0.002 
level because they generally have higher error levels. 

By examining the tables we observe that a dramatic 



Table 5 . Computer Categories 
Various Procedures 
10291 True Matches 
12072 Records, St Louis 
Pairs Agree on Cluster and 
First Character Surname 1^/ 

-computer designation- 
match clerical 



truth-> 



crude 

param 

param2 

em 

em2 

f req 



matchlnon- matchlnon- 
I match Imacch 



310/ 1 

7899/ 16 

9276/ 23 

9587/ 23 

9639/ 24 

9801/ 24 



9344/794 
1863/198 
545/191 
271/192 
215/189 
52/ 94 



i,/ Approximately 400 true matches 
disagree on first character of 
surname and are net eligible 
for inclusion in the table. 



Table S. 



Computer Categories 
Various Procedures 
6984 True Matches 

7 649 Records^ Columbia 
Pairs Agree on Cluster and 

First Character Surname 1^/ 

computer designation 
match clerical 



truth-> 



crude 

param 

param2 

em 

em2 

f req 



match I non- 
I match 

2429/ 7 
6449/ 22 
6655/ 13 
6719/ 13 
6762/ 13 
6792/ 11 



match I non- 
I match 

4327/119 
327/ 92 
135/ 35 
78/ 22 
37/ 20 
6/ 9 



V Approximately 180 true matches 
disagree on first character of 
surname and are not eligible for 
inclusion ir the table. 
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Table 7 . Computer Categories 
Various Procedures 
1950 True Matches 

2214 Records, Washington 
Pairs Agree on Cluster and 

First Character Surname 1^/ 

computer designation 
match clerical 

truth-> matchlnon- match |non- 
I match I match 



crude 

param 

param2 

em 

em2 

f req 



1307/ 
1250/ 
1765/ 
1749/ 
1795/ 
1892/ 



564/ 98 

614/ 88 

134/ 41 

149/ 29 

107/ 29 

7/ 9 



1^/ Approximately 40 true matches 
disagree on first character of 
surname and are not eligible 
for inclusion in the table . 



4. DISCUSSION 

4.1. Quality of Empirical Data Bases 

Because of the relatively large number of 
identifying fields for matching, all results in section 
3.2 are relatively better than might be expected in 
general matching applications. Also, having two key 
fields such as fust name and last name with 
typographical variation sufficiently severe for 
assignment of full disagreement weight to a true 
match is very rare (below 0.1 percent). The data are, 
however, representative of the type of data that will 
be encountered during the 1990 Post Enumeration 
Survey. 

The data are suitable ((x evaluating matching 
procedures because essentially all matches were found 
and correctly identified. The identification is with 
codes specifying to which record a record is matched. 
All basic identifying information was carefully 
checked and rechecked. In particular, no matches 
were found among the set of code-identified 
nonmatches using a variety of procedures. 

4.2. General String Comparator Metrics 

For matching applications of files having 
significandy different characteristics (i.e., matching 
fields) from those of the files of this paper, string 
comparator weighting adjustments may have to be 
remodelled. 

In all matching situations, it seems likely that 
modelling partial agreement should improve matching 
efficacy because the proportions of exact agreement 
on key matching fields can be quite low. For the 
files of this paper, the proportions of true matches 



agreeing on a character-by-character basis (O^sLO) 
are approximately 76 percent for first name and 
approximately 86 percent for last name (Table 8). 



Table 8. 



First 
<t>,= 1.0 
<t>„>0 . 6 

Last 

<t> >0 . 6 



Proportional Agreement by 
String Comparator Values 
Key Fields bv Geography 



StL 



Col 



75 
93 



82 
94 



0.85 
0.95 



0.88 
0.96 



Wash 

0.75 
0.93 



0.86 
0.96 



5. SUMMARY 

This paper contains a new string comparator that 
partially accounts for minor typographical variation 
when two strings are compared. The theoretical 
decision rules of Fellegi and Sunter (1969) are still 
valid when general weighting adjustments accounting 
for partial agreement are performed. 

*This paper reports general results of research by 
Census Bureau staff. The views expressed are 
attributable to the author and do \\oi necessarily 
reflect those of the Census Bureau. 
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EMD 

U.S. Dept. of Education 

Office of Education 

Research and 
Improvement (OERI) 



ERIC 



Date Filmed 
March 29, 1991 



